Method of accessing a dial-up service

ABSTRACT

A method of accessing a dial-up service is disclosed. An example method of providing access to a service includes receiving a first speech signal from a user to form a first utterance; recognizing the first utterance using speaker independent speaker recognition; requesting the user to enter a personal identification number; and when the personal identification number is valid, receiving a second speech signal to form a second utterance and providing access to the service.

PRIORITY INFORMATION

The present application is a continuation of U.S. patent applicationSer. No. 13/873,638, filed Apr. 30, 2013, which is a continuation ofU.S. patent application Ser. No. 13/251,634, filed Oct. 3, 2011, nowU.S. Pat. No. 8,433,569, issued Apr. 30, 2013, which is a continuationof U.S. patent application Ser. No. 12/029,091, filed Feb. 12, 2008, nowU.S. Pat. No. 8,032,380, issued Oct. 4, 2011, which is a continuation ofU.S. patent application Ser. No. 11/004,287, filed on Dec. 3, 2004, nowU.S. Pat. No. 7,356,134, issued Apr. 8, 2008, which is a continuation ofU.S. patent application Ser. No. 08/863,462, filed on May 27, 1997, nowU.S. Pat. No. 6,847,717, issued Jan. 25, 2005, all of which are herebyincorporated by reference.

FIELD OF THE INVENTION

The present invention is related to the field of speech recognitionsystems and more particularly to a speaker verification method.

BACKGROUND OF THE INVENTION

Speech recognition and speaker verification use similar analysis toolsto achieve its goals. An input utterance is first processed to determineits essential characteristics. Typically, input utterances are convertedto cepstrum coefficients. A cepstrum is an inverse Fourier transform ofthe log power spectrum. In a training phase the cepstrum coefficientsare saved to form code books for specific utterances. For instance, acode book might have codes for the numeral zero through nine. In speechrecognition, an input utterance is compared to the codes (trainingutterances) in the code book to determine which is most similar. Inspeech recognition the code is a generalized representation of manypeople's way of forming an utterance (e.g., “zero”). In speakerverification the codes represent the individual characteristics of thespeaker and the verification system tries to deter nine if a person'scode is more similar to an input utterance, than an impostor code. As aresult the codes in a speaker verification system emphasis individualcharacteristics, while in a speech recognition system the codesgeneralize over many individual speakers. Speaker verification haspotential applications in a number of voice activated systems, such asbanking over the telephone. Unfortunately, present speaker verificationsystems have not proven reliable enough for these applications.

Thus there exists a need for a dial-up service that can be used withtoday's speaker verifications systems capabilities and profit by theincorporation of advanced speaker verification systems.

SUMMARY OF THE INVENTION

A method of accessing a dial-up service that meets these goals involvesthe following steps: (a) dialing a service number; (b) speaking a numberof digits to form a first utterance; (c) recognizing the digits usingspeaker independent speaker recognition; (d) when a user has used thedial-up service previously, verifying the user based on the firstutterance using a speaker verification system; (e) when the user cannotbe verified, requesting the user enter a personal identification number;and (f) when the personal identification number is valid, providingaccess to the dial-up service.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an embodiment of a speaker verificationsystem;

FIG. 2 is a flow chart of an embodiment of the steps used to form aspeaker verification decision;

FIG. 3 is a flow chart of an embodiment of the steps used to form a codebook for a speaker verification decision;

FIG. 4 is a flow chart of an embodiment of the steps used to form aspeaker verification decision;

FIG. 5 is a schematic diagram of a dial-up service that incorporates aspeaker verification method;

FIG. 6 is a flow chart of an embodiment of the steps used in a dial-upservice; and

FIG. 7 is a flow chart of an embodiment of the steps used in a dial-upservice.

DETAILED DESCRIPTION OF THE DRAWINGS

Several improvements in speaker verification methods are described andthen a dial-up service that can incorporate these improvements isexplained. FIG. 1 is a block diagram of an embodiment of a speakerverification system 10. It is important to note that the speakerverification system can be physically implemented in a number of ways.For instance, the system can be implemented as software in a generalpurpose computer connected to a microphone; or the system can beimplemented as firmware in a general purpose microprocessor connected tomemory and a microphone; or the system can be implemented using aDigital Signal Processor (DSP), a controller, a memory, and a microphonecontrolled by the appropriate software. Note that since the process canbe performed using software in a computer, then a computer readablestorage medium containing computer readable instructions can be used toimplement the speaker verification method. These various systemarchitectures are apparent to those skilled in the art and theparticular system architecture selected will depend on the application.

A microphone 12 receives an input speech and converts the sound waves toan electrical signal. A feature extractor 14 analyzes the electricalsignal and extracts key features of the speech. For instance, thefeature extractor first digitizes the electrical signal. A cepstrum ofthe digitized signal is then performed to determine the cepstrumcoefficients. In another embodiment, a linear predictive analysis isused to find the linear predictive coding (LPC) coefficients. Otherfeature extraction techniques are also possible.

A switch 16 is shown attached to the feature extractor 14. This switch16 represents that a different path is used in the training phase thanin the verification phase. In the training phase the cepstrumcoefficients are analyzed by a code book generator 18. The output of thecode book generator 18 is stored in the code book 20. In one embodiment,the code book generator 18 compares samples of the same utterance fromthe same speaker to form a generalized representation of the utterancefor that person. This generalized representation is a training utterancein the code book. The training utterance represents the generalizedcepstrum coefficients of a user speaking the number “one” as an example.A training utterance could also be a part of speech, a phoneme, or anumber like “twenty one” or any other segment of speech. In addition tothe registered users' samples, utterances are taken from a group ofnon-users. These utterances are used to form a composite that representsan impostor code having a plurality of impostor utterances.

In one embodiment, the code book generator 18 determines whether thespeaker (users and non-users) is male or female. The male trainingutterances (male group) are aggregated to determining a male variancevector. The female training utterances (female group) are aggregated todetermine a female variance vector. These gender specific variancevectors will be used when calculating a weighted Euclidean distance(measure of closeness) in the verification phase.

In the verification phase the switch 16 connects the feature extractor14 to the comparator 22. The comparator 22 performs a mathematicalanalysis of the closeness between a test utterance from a speaker with atraining utterance stored in the code book 20 and between the testutterance and an impostor utterance. In one embodiment, a test utterancesuch as a spoken “one” is compared with the “one” training utterance forthe speaker and the “one” impostor utterance. The comparator 22determines a measure of closeness between the “one” training utterancesthe “one” test utterance and the “one” impostor utterance. When the testutterance is closer to the training utterance than the impostorutterance, the speaker is verified as the true speaker. Otherwise thespeaker is determined to be an impostor. In one embodiment, the measureof closeness is a modified weighted Euclidean distance. The modificationin one embodiment involves using a Generalized variance vector insteadof an individual variance vector for each of the registered users. Inanother embodiment, a male variance vector is used for male speakers anda female variance vector is used for a female speaker.

A decision weighting and combining system 24 uses the measure ofcloseness to determine if the test utterance is closest to the trainingutterance or the impostor utterance. When the test utterance is closerto the training utterance than the impostor utterance, a verifieddecision is made. When the test utterance is not closer to the trainingutterance than the impostor utterance, an un-verified decision is made.These are preliminary decisions. Usually, the speaker is required tospeak several utterances (e.g., “one”, “three”, “five”, “twenty one”). Adecision is made for each of these test utterances. Each of theplurality of decisions is weighted and combined to form the verificationdecision.

The decisions are weighted because not all utterances provide equalreliability. For instance, “one” could provide a much more reliabledecision than “eight”. As a result, a more accurate verificationdecision can be formed by first weighting the decisions based on theunderlying utterance. Two weighting methods can be used. One weightingmethod uses a historical approach. Sample utterances are compared to thetraining utterances to determine a probability of false alarm PFA(speaker is not impostor but the decision is impostor) and a probabilityof miss PM (speaker is impostor but the decision is true speaker). ThePFA and PM are probability of errors. These probability of errors areused to weight each decision. In one embodiment the weighting factors(weight) are described by the equation below:

$a_{i} = {\log\frac{1 - P_{Mi}}{P_{FAi}}\mspace{14mu}{Decision}\mspace{14mu}{is}\mspace{14mu}{Verified}\mspace{14mu}\left( {{True}\mspace{14mu}{Speaker}} \right)}$$a_{i} = {\log\frac{P_{Mi}}{1 - P_{FAi}}\mspace{14mu}{Decision}\mspace{14mu}{is}\mspace{14mu}{Not}\mspace{14mu}{Verified}\mspace{14mu}({Impostor})}$

When the sum of the weighted decisions is greater than zero, then theverification decision is a true speaker. Otherwise the verificationdecision is an impostor.

The other method of weighting the decisions is based on an immediateevaluation of the quality of the decision. In one embodiment, this iscalculated by using a Chi-Squared detector. The decisions are thenweighted on the confidence determined by the Chi-Squared detector. Inanother embodiment, a large sample approximation is used. Thus if thetest statistics are t, find b such that c2(b)=t. Then a decision is animpostor if it exceeds the 1-a quantile of the c2 distribution.

One weighting scheme is shown below:

-   -   1.5, if b>c_(accept)    -   1.0, if 1-a≦b≦c_(accept)    -   −1.0, if c_(reject)≦b≦1-a    -   −1.25, if b<c_(reject)

When the sum of the weighted decisions is greater than zero, then theverification decision is a true speaker. When the sum of the weighteddecision is less than or equal to zero, the decision is an impostor.

In another embodiment, the feature extractor 14 segments the speechsignal into voiced sounds and unvoiced sounds. Voiced sounds generallyinclude vowels, while most other sounds are unvoiced. The unvoicedsounds are discarded before the cepstrum coefficients are calculated inboth the training phase and the verification phase.

These techniques of weighting the decisions, using gender dependentcepstrums and only using voiced sounds can be combined or usedseparately in a speaker verification system.

FIG. 2 is a flow chart of an embodiment of the steps used to form aspeaker verification decision. The process starts, at step 40, bygenerating a code book at step 42. The code book has a plurality oftraining utterances for each the plurality of speakers (registeredusers, plurality of people) and a plurality of impostor utterances. Thetraining utterances in one embodiment are the cepstrum coefficients fora particular user speaking a particular utterance (e.g., “one). Thetraining utterances are generated by a user speaking the utterances. Thecepstrum coefficients of each of the utterances are determined to formthe training utterances. In one embodiment a speaker is asked to repeatthe utterance and a generalization of the two utterances is saved as thetraining utterance. In another embodiment both utterances are saved astraining utterances.

In one embodiment, a data base of male speakers is used to determine amale variance vector and a data base of female speakers is used todetermine a female variance vector. In another embodiment, the databases of male and female speakers are used to form a male impostor codebook and a female impostor code book. The gender specific variancevectors are stored in the code book. At step 44, a plurality of testutterances (input set of utterances) from a speaker are received. In oneembodiment the cepstrum coefficients of the test utterances arecalculated. Each of the plurality of test utterances are compared to theplurality of training utterances for the speaker at step 46. Based onthe comparison, a plurality of decision are formed, one for each of theplurality of training utterances. In one embodiment, the comparison isdetermined by a Euclidean weighted distance between the test utteranceand the training utterance and between the test utterance and animposter utterance. In another embodiment, the Euclidean weighteddistance is calculated with the male variance vector if the speaker is amale or the female variance vector if the speaker is a female. Each ofthe plurality of decisions are weighted to form a plurality of weighteddecisions at step 48. The weighting can be based on historical errorrates for the utterance or based on a confidence level (confidencemeasure) of the decision for the utterance. The plurality of weighteddecisions are combined at step 50. In one embodiment the step ofcombining involves summing the weighted decisions. A verificationdecision is then made based on the combined weighted decisions at step52, ending the process at step 54. In one embodiment if the sum isgreater than zero, the verification decision is the speaker is a truespeaker, otherwise the speaker is an impostor.

FIG. 3 is a flow chart of an embodiment of the steps used to form a codebook for a speaker verification decision. The process starts, at step70, by receiving an input utterance at step 72. In one embodiment, theinput utterances are then segmented into a voiced sounds and an unvoicedsounds at step 74. The cepstrum coefficients are then calculated usingthe voiced sounds at step 76. The coefficients are stored as a trainingutterance for the speaker at step 78. The process then returns to step72 for the next input utterance, until all the training utterances havebeen stored in the code book.

FIG. 4 is a flow chart of an embodiment of the steps used to form aspeaker verification decision. The process starts, at step 100, byreceiving input utterances at step 102. Next, it is determined if thespeaker is male or female at step 104. In a speaker verificationapplication, the speaker purports to be someone in particular. If theperson purports to be someone that is a male, then the speaker isassumed to be male even if the speaker is a female. The input utterancesare then segmented into a voiced sounds and an unvoiced sounds at step106. Features (e.g., cepstrum coefficients) are extracted from thevoiced sounds to form the test utterances, at step 108. At step 110, theweighted Euclidean distance (WED) is calculated using either ageneralized male variance vector if the purported speaker is a male.When the purported speaker is a female, the female variance vector isused. The WED is calculated between the test utterance and the trainingutterance for the speaker and the test utterance and the male (or femaleif appropriate) impostor utterance. A decision is formed for each testutterance based on the WED at step 112. The decisions are then weightedbased on a confidence level (measure of confidence) determined using aChi-squared detector at step 114. The weighted decisions are summed atstep 116. A verification decision is made based on the sum of theweighted decisions at step 118.

Using the speaker verification decisions discussed above results in animproved speaker verification system that is more reliable than presenttechniques.

A dial-up service that uses a speaker verification method as describedabove is shown in FIG. 5. The dial-up service is shown as a bankingservice. A user dials a service number on their telephone 150. Thepublic switched telephone network (PSTN) 152 then connects the user'sphone 150 with a dial-up service computer 154 at a bank 156. The dial-upservice need not be located within a bank. The service will be explainedin conjunction with the flow chart shown in FIG. 6. The process starts,at step 170, by dialing a service number (communication service address,number) at step 172. The user (requester) is then prompted by thecomputer 154 to speak a plurality of digits (access code, plurality ofnumbers, access number) to form a first utterance at step 174. Thedigits are recognized using speaker independent voice recognition atstep 176. When the user has used the dial-up service previously,verifying the user based on the first utterance at step 178. When theuser is verified as a true speaker at step 178, allowing access to thedial-up service at step 180. When the user cannot be verified,requesting the user input a personal identification number (PIN) at step182. The PIN can be entered by the user either by speaking the PIN or byentering the PIN on a keypad. At step 184 it is determined if the PIN isvalid. When the PIN is not valid, the user is denied access at step 186.When, the PIN is valid the user is allowed access to the service at step180. Using the above method the dial-up service uses a speakerverification system as a PIN option, but does not deny access to theuser if it cannot verify the user.

FIG. 7 is a flow chart of another embodiment of the steps used in adial-up service. The process starts, step 200, by the user speaking anaccess code to form a plurality of utterances at step 202. At step 204it is determined if the user has previously accessed the service. Whenthe user has previously used the service, the speaker verificationsystem attempts to verify the user (identity) at step 206. When thespeaker verification system can verify the user, the user is allowedaccess to the system at step 208. When the system cannot verify theuser, a PIN is requested at step 210. Note the user can either speak thePIN or enter the PIN on a keypad. At step 212 it is determined if thePIN is valid. When the PIN is not valid the user is denied access atstep 214. When the PIN is valid, the, user is allowed access at step208.

When the user has not previously accessed the communication service atstep 204, the user is requested to enter a PIN at step 216. At step 218it is determined if the, PIN is valid at step 218. When the PIN is notvalid, denying access to the service at step 220. When the PIN is validthe user is asked to speak the access code a second time to form asecond utterance (plurality of second utterances) at step 222. Thesimilarity between the first utterance (step 202) and the secondutterance is compared to a threshold at step 224. In one embodiment thesimilarity is calculated using a weighted Euclidean distance. When thesimilarity is less than or equal to the threshold, the user is asked tospeak the access code again at step 222. In this case the second andthird utterances would be compared for the required similarity. Inpractice, the user would not be required to repeat the access code atstep 222 more than once or twice and the system would then allow theuser access. When the similarity is greater than the threshold, storinga combination of the two utterances as at step 226. In anotherembodiment both utterances are stored as reference utterances. Nextaccess to the service is allowed at step 208. The reference utterance(plurality of reference utterances, reference voiced sounds) is used toverify the user the next time they access the service. Note that thespeaker verification part of the access to the dial-up service in oneembodiment uses all the techniques discussed for a verification process.In another embodiment the verification process only uses one of thespeaker verification techniques. Finally, in another embodiment theaccess number has a predetermined digit that is selected from a firstset of digits (predefined set of digits) if the user is a male. When theuser is a female, the predetermined digit is selected from a second setof digits. This allows the system to determine if the user is supposedto be a male or a female. Based on this information, the male variancevector or female variance vector is used in the speaker verificationprocess.

Thus there has been described an improved speaker verification methodand a service that takes advantage of the speaker verification method.While the invention has been described in conjunction with specificembodiments thereof, it is evident that many alterations, modifications,and variations will be apparent to those skilled in the art in light ofthe foregoing description. Accordingly, it is intended to embrace allsuch alterations, modifications, and variations in the appended claims.

What is claimed is:
 1. A method comprising: generating, via a processor,a feature coefficient based on a speech signal from a user; comparingthe feature coefficient to a user-specific codebook associated with theuser, to yield a similarity value, wherein the user-specific codebookutilizes utterances from both the user and a group of non-users; andwhen the similarity value meets a threshold: adding the speech signal toa database of reference speech signals associated with the user-specificcodebook; and adding the feature coefficient to the user-specificcodebook.
 2. The method of claim 1, further comprising: mixing thespeech signal with a second speech signal, to yield a mixed speechsignal; and adding the mixed speech signal to the database of referencespeech signals.
 3. The method of claim 2, wherein the speech signal andthe second speech signal are received from the user.
 4. The method ofclaim 1, wherein the feature coefficient is one of a cepstrumcoefficient and a linear predictive coding coefficient.
 5. The method ofclaim 1, wherein the threshold is determined using a Chi-squareddetector.
 6. The method of claim 1, further comprising: when thesimilarity value does not meet the threshold, requesting the speechsignal be repeated.
 7. The method of claim 1, further comprisingverifying an identity of the user based on the similarity value.
 8. Asystem comprising: a processor; and a computer-readable storage mediumhaving instructions stored which, when executed by the processor, causethe processor to perform operations comprising: generating a featurecoefficient based on a speech signal from a user; comparing the featurecoefficient to a user-specific codebook associated with the user, toyield a similarity value, wherein the user-specific codebook utilizesutterances from both the user and a group of non-users; and when thesimilarity value meets a threshold: adding the speech signal to adatabase of reference speech signals associated with the user-specificcodebook; and adding the feature coefficient to the user-specificcodebook.
 9. The system of claim 8, the computer-readable storage mediumhaving additional instructions stored which, when executed by theprocessor, cause the processor to perform operations comprising: mixingthe speech signal with a second speech signal, to yield a mixed speechsignal; and adding the mixed speech signal to the database of referencespeech signals.
 10. The system of claim 9, wherein the speech signal andthe second speech signal are received from the user.
 11. The system ofclaim 8, wherein the feature coefficient is one of a cepstrumcoefficient and a linear predictive coding coefficient.
 12. The systemof claim 8, wherein the threshold is determined using a Chi-squareddetector.
 13. The system of claim 8, the computer-readable storagemedium having additional instructions stored which, when executed by theprocessor, cause the processor to perform operations comprising: whenthe similarity value does not meet the threshold, requesting the speechsignal be repeated.
 14. The system of claim 8, the computer-readablestorage medium having additional instructions stored which result inoperations comprising verifying an identity of the user based on thesimilarity value.
 15. A computer-readable storage device havinginstructions stored which, when executed by a computing device, causethe computing device to perform operations comprising: generating afeature coefficient based on a speech signal from a user; comparing thefeature coefficient to a user-specific codebook associated with theuser, to yield a similarity value, wherein the user-specific codebookutilizes utterances from both the user and a group of non-users; andwhen the similarity value meets a threshold: adding the speech signal toa database of reference speech signals associated with the user-specificcodebook; and adding the feature coefficient to the user-specificcodebook.
 16. The computer-readable storage device of claim 15, havingadditional instructions stored which, when executed by the computingdevice, cause the computing device to perform operations comprising:mixing the speech signal with a second speech signal, to yield a mixedspeech signal; and adding the mixed speech signal to the database ofreference speech signals.
 17. The computer-readable storage device ofclaim 16, wherein the speech signal and the second speech signal arereceived from the user.
 18. The computer-readable storage device ofclaim 15, wherein the feature coefficient is one of a cepstrumcoefficient and a linear predictive coding coefficient.
 19. Thecomputer-readable storage device of claim 15, wherein the threshold isdetermined using a Chi-squared detector.
 20. The computer-readablestorage device of claim 15, having additional instructions stored which,when executed by the computing device, cause the computing device toperform operations comprising: when the similarity value does not meetthe threshold, requesting the speech signal be repeated.